We will try to explain why the model made a specific prediction on some examples
import pickle
import dalex as dx # XAI library
import matplotlib.pyplot as plt
%matplotlib inline
import scikitplot as skplt
import pandas as pd
import numpy as np
np.random.seed(1)
train_data = pd.read_csv('../Data/preprocessed2-train-bank-data.csv', sep=';')
test_data = pd.read_csv('../Data/preprocessed2-test-bank-data.csv', sep=';')
train_data.head()
| age | campaign | contacted.in.previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | job_admin. | job_blue-collar | ... | contact_cellular | contact_telephone | poutcome_failure | poutcome_nonexistent | poutcome_success | month_sin | month_cos | day_sin | day_cos | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.322632 | 1.000000 | 0.0 | 1.000000 | 0.612813 | 0.390735 | 0.970664 | 1.000000 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.500000 | -0.866025 | -5.877853e-01 | -0.809017 | 0.0 |
| 1 | 0.552137 | 0.000000 | 0.0 | 1.000000 | 0.422695 | 0.724448 | 0.970664 | 1.000000 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.866025 | -0.500000 | -9.510565e-01 | 0.309017 | 0.0 |
| 2 | 0.636314 | 0.495659 | 0.0 | 0.854914 | 0.644951 | 0.710269 | 0.932652 | 0.710922 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.500000 | -0.866025 | -2.449294e-16 | 1.000000 | 0.0 |
| 3 | 0.798640 | 0.000000 | 0.0 | 0.432174 | 0.331438 | 0.428275 | 0.714556 | 0.743803 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | -0.500000 | 0.866025 | 9.510565e-01 | 0.309017 | 0.0 |
| 4 | 0.552137 | 0.495659 | 0.0 | 1.000000 | 0.612813 | 0.390735 | 0.970309 | 1.000000 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.500000 | -0.866025 | -9.510565e-01 | 0.309017 | 0.0 |
5 rows × 44 columns
X_train, y_train = train_data.drop('y', axis=1), train_data['y']
X_test, y_test = test_data.drop('y', axis=1), test_data['y']
"""Loading pickled Random Forest Model"""
file = open('../Models/Serialized_models/random_forest_classifier_gs.pickle', 'rb')
rfc = pickle.load(file)
file.close()
print(rfc)
RandomForestClassifier(class_weight='balanced', criterion='entropy',
max_depth=9, min_samples_leaf=3, n_estimators=80,
random_state=1)
explainer = dx.Explainer(rfc, X_test, y_test)
Preparation of a new explainer is initiated -> data : 7975 rows 43 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 7975 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x00000275AB336288> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.135, mean = 0.369, max = 0.977 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.974, mean = -0.26, max = 0.844 -> model_info : package sklearn A new explainer has been created!
C:\Users\frane\anaconda3\envs\kaggleproject\lib\site-packages\sklearn\base.py:451: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names "X does not have valid feature names, but"
examples = np.random.randint(low=0,high=X_test.shape[0]-1,size=4)
for i in examples:
shap = explainer.predict_parts(X_test.loc[i], type='shap', B=44)
shap.plot()
Shap plots can help us understand why model made a specific prediction in the case. It shows which features had a biggest impact on a classification and was the impact positive or negative.
It shows average probability prediction for individual variable.
pdp = explainer.model_profile()
pdp.plot()
Calculating ceteris paribus: 100%|█████████████████████████████████████████████████████| 43/43 [00:06<00:00, 7.13it/s]
permutation_importance = explainer.model_parts(loss_function='rmse')
permutation_importance.plot()
skplt.metrics.plot_lift_curve(y_test, rfc.predict_proba(X_test))
plt.legend(loc='upper right')
plt.show()
For example, we see that contacting customers with the top 20% predicted probabilities is more than 3 times better than randomly selecting them.
skplt.metrics.plot_cumulative_gain(y_test, rfc.predict_proba(X_test))
plt.legend(loc='lower right')
plt.show()
In our case it shows how many responders will we reach after contacting some ammount of all of the customers with the highest predicted probabilities.
For example after contacting customers with top 20% predicted probabilities, we will reach over 60% of customers that will place the advertised deposit.